The Importance of New York Times’s Lawsuit Against OpenAI, Microsoft, Explained

Adriana MaraisDecember 29, 23 OpenAI

The New York Times has sued OpenAI and Microsoft over the use of copyrighted data, becoming the first major media company to do so.

NYT claims in the lawsuit that the large language models used by OpenAI and Microsoft for generative AI were created “by copying and using millions of The Times’ copyrighted news articles, in-depth studies, opinion pieces, reviews and guides, and more.”

This legal battle could set a precedent for how courts determine the value of news content in training large language models and what damages are from prior use.

“I think [the lawsuit] shows the front of all platforms how they’ve trained their data, but also how they label the data coming out and compress the data in a way that they can compensate the organizations behind the training data,” Shaunt Sarkissian, AI-ID: n, founder and CEO of an AI tracking, authentication, source validation and source data management/management platform, told PYMNTS.

“The era of leisure travel is over,” he added.

The lawsuit opens a new front in the years-long dispute between technology and media companies over the internet’s economy, pitting one of the news industry’s most powerful players against the pioneers of a new wave of artificial intelligence tools. It comes after months of business talks between the two companies ended without a deal, as reported by the Times.

The Times has requested a jury trial in the lawsuit, which was filed in US federal court in the Southern District of New York.

What is The Times’ Grouse?

The lawsuit alleges, among several examples, that Microsoft’s “Browse with Bing” feature copied content from The Times’ product recommendation platform Wirecutter in significant verbatim and direct copying. Additionally, the lawsuit accuses OpenAI’s GPT-4 of falsely making recommendations to Wirecutter.

“We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models,” an OpenAI spokesperson told Axios in a statement.

“Our ongoing discussions with the New York Times have been productive and constructive, so we are surprised and disappointed by this development. We hope we can find a mutually beneficial way to work together, as we do with many other publishers.”

Tech companies developing generative AI tools have often argued that content available on the open Internet can be used to train their technology under a legal term known as “fair use,” which allows copyrighted material to be used without permission under certain circumstances.

“The New York Times placed a very strong stake in the country, demonstrating the value and importance of protecting news content,” Danielle Coffey, CEO of the News/Media Alliance, a trade group for news publishers, told the Wall Street Journal. “Quality journalism and these new technologies, especially those competing for the same audience, can complement each other if approached collaboratively.”

Basic problem

ChatGPT is a large language model (LLM) that uses the Generative Pre-trained Transformer architecture. Trained on large datasets, it learns grammar, context, and language patterns. The model is “generative”, producing human-like text. Pre-training exposes it to different language examples, allowing it to predict the next word. Fine tuning Customize the model for specific tasks.

Simply put, this process is carried out by automated tools known as web crawlers and web scrapers, which are similar to the technology used to create search engines. Describe web crawlers as virtual spiders that navigate URL paths and systematically document the whereabouts of each object they encounter.

Those LLMs were fed billions of Times content, including subscription-protected content, multiple times to train the GenAI, the NYT has alleged in the suit.

OpenAI, like other big tech companies, has become less transparent about its training data. OpenAI used Common Crawl, a well-known source, in at least one version of the large language model using ChatGPT. In contrast to the detailed information provided during the development of GPT-3, recent releases such as GPT-3.5 and GPT-4 provide limited insights into the training process and data used.

The latest OpenAI technical report specifically cites a lack of detail due to the competitive landscape and security concerns of large models like GPT-4, stating that it does not include information on architecture, hardware, training methods, or datasets.

On the other hand, big tech like OpenAI and Microsoft lift protected on-demand content from online publications to feed into their LLM models without proper attribution or credit, claiming it’s fair use; on the other hand, they basically charge users for the text it generates, claiming it as theirs. Basically, it’s billions of dollars worth of plagiarized content that companies like OpenAI have turned into a for-profit organization in a matter of years.

The AI model currently in use has such ridiculous profits that companies like Microsoft have invested billions of dollars. Microsoft invested its first $1 billion in OpenAI in 2019 and added at least $10 billion in January.

Although big tech is investing billions and making billions, online publishing has suffered. Online publications such as NYT, Reuters, BBC and CNN with no attribution, credit or referrals and no “fair profit sharing” model. In addition, companies blocked the OpenAI crawler in an attempt to stop this appalling scale of data exploitation on their websites.

Comedian and author Sarah Silverman is part of a copyright infringement class action lawsuit against OpenAI and Meta. Content creators in various fields, including writers, musicians, artists, and photographers, are grappling with the potential impact of generative AI technology on their fields and how to secure their creative works.

Author groups also filed at least two lawsuits against OpenAI this year, accusing the company of training AI with copyrighted works without their permission and using illegal copies of their books plucked from the Internet.

In August, the US Copyright Office launched an initiative to investigate the use of copyrighted material in AI training, indicating that legislative or regulatory action may be necessary in the near future to address the use of copyrighted material in AI model training datasets.

How NOW Make money?

Media organizations like NYT have several sources of income. These include a subscription-based model, advertising across digital and print platforms, licensing revenue, affiliate referrals, building rental revenue, commercial printing, NYT Live (live events business) and retail.

The subscription-based model has generated the most revenue for NYT for more than four years. In 2022, subscriptions accounted for a significant portion of The New York Times’ revenue, 67% of total revenue. Of the $2.3 billion in revenue, subscriptions, including both print and digital formats, generated $1.55 billion.

Advertising, which covers both print and digital media, generated $523 million, with an additional $232 million from other sources. Digital subscriptions played a significant role, generating more than $978 million, while print subscriptions accounted for $573 million of subscription revenue.

In addition to the generated income, NYT also employs 5,800 people. Alleged data mining by companies like OpenAI is a direct threat not only to billions of dollars in revenue, but also to the livelihood of the 5,800 employees working in the organization.

The buck doesn’t just stop there; Plagiarized content available on the Internet and calling it your own threatens the creation of original content and essentially demotivates it. This is basically an attack on the democratic way in which users and organizations publish content on the Internet.

Also, while NYT tries to protect itself with a subscription model, there are many other online repositories that are free and can be accessed by AI company crawlers. Publicly available information on the Internet includes a variety of sources, including images on Flickr, online marketplaces, voter registration records, government websites, business environments, employee profiles, Wikipedia, Reddit, research repositories, and free news platforms.

Additionally, there is plenty of readily available unauthorized content and archived collections that may contain deleted personal blogs or other embarrassing content from the past.

In April, The Washington Post conducted an analysis that revealed that one data set used to train AI spans nearly the entire 30-year history of the Internet. Tech companies have extensively scraped this data in an effort to increase the parameters of their models to billions and even trillions so they can improve the training of their AI models.

Additionally, there is nothing stopping companies like OpenAI from starting their own online publications and using generative AI to create content and openly publishing content that has been lifted and plagiarized from any number of sources available to it.

The lives of artists and creatives are also being disrupted by the ability of AI tools to create content, including written material and images, with tools like OpenAI’s DALL-E. This advanced set of tools poses a tangible challenge to the income of working artists, prompting them to urgently seek methods to protect their creations from being included in the datasets used to train AI tools.

A study conducted in August revealed that shortly after ChatGPT was launched, it had a detrimental effect on the employment prospects and income of online freelancers such as copywriters and graphic designers. The effect was particularly strong among highly skilled freelancers who performed numerous tasks and earned substantial incomes.

The AI revolution

In just over a year, the generative AI model has brought the technology industry back from the brink. Last year around November, the tech industry was hit with mass layoffs, falling profits, plummeting stocks of major tech companies, dropping market capitalization into the billions, and a host of other issues that can be seen in any declining sector. Then on November 30th, start Open AI with an experimental chatbot called ChatGPT.

A year after ChatGPT’s public debut, the excitement surrounding AI is still high. Tech giants have invested billions in technology, and countries are stockpiling the necessary chips for future AI efforts. In just two months, ChatGPT achieved unprecedented growth and became the fastest growing consumer app ever. By January, it is estimated to have amassed 100 million active users, sparking an AI weaponization among companies and reinvigorating the tech sector.

Microsoft has invested nearly $13 billion in OpenAI, Google is investing nearly $8 billion every quarter in Bard, and Meta and X, formerly Twitter, are investing billions in AI models.

According to the Organization for Economic Co-operation and Development (OECD), 21% of global venture capital investments in 2020 (the latest compilation) went into AI, an estimated $75 billion.

With such massive investments supporting consumer creative AI, they could potentially drive publications out of business by using a nuanced legal concept of “free access,” which basically allows content to be created using content available on the Internet for individual users, not tech giants backing them with billions to stop publishing.

A fair model of profit sharing?

The multitude of generative AI chatbots using large language models are here to stay, but what is not allowed is the unauthorized use of copyrighted data. What is needed is a “fair profit sharing” model that encourages users to create and continue to create original content.

In the absence of adequate safeguards, AI companies risk jeopardizing the business of the news organizations they depend on to train their algorithms. This scenario poses an existential threat to both AI companies and news organizations in the long run.

Some organizations, such as the Associated Press and Axel Springer, have already entered into commercial agreements to license content to OpenAI. In these agreements, news companies are compensated in exchange for permission to use their content to train large language models of artificial intelligence.

News media executives are skeptical of technology companies based on their experiences of the past decade. Although Google and Facebook initially helped publishers expand their audience and increase online traffic, they became formidable competitors for online advertising revenue.

These tech junkies had the power to influence the growth or decline of news traffic through algorithmic changes. Publishers who have been unable to secure a fair share of the Internet’s significant growth enabled by search and social media are now reluctant to face a similar fate in the field of artificial intelligence.

What next?

Although the lawsuit aims for a jury trial, it does not specify a specific monetary claim. Nevertheless, the complaint highlights the claim that Microsoft and OpenAI should be liable for “billions of dollars in statutory and actual damages.”

Over the past decade, news publishers have actively sought congressional protection against Big Tech companies that have used their content to engage on social media and search engines. The emergence of artificial intelligence has prompted a new lobbying initiative from news executives who argue that tech companies lack the authority to hijack their content within the bounds of fair use set by existing copyright laws.

“This case probably sets the benchmark for what is the economic threshold or what are reasonable royalties for fair use of content,” Sarkissian said. “Everyone is using The New York Times as a proxy and seeing how it goes.”